Objective
Cluster high-flow event characteristics and antecedent watershed conditions to evaluate how these factors converge as flux regimes (clusters) to produce variability in:
1. Event NO3 yields
2. Event SRP yields
3. Event turbidity yields
4. Event NO3:SRP yield ratios
Select variables to keep
Per K Underwood: if two variables are strongly correlated (negatively or positively) they can effectively “double-weight” a particular factor important in driving clustering; thus, keep just one of the variables to serve as a proxy for that factor

Decisions for eliminating variables w/ correlations >70%
These were used for the 2020-12-10 run:
- These decisions were tough to make and need review
- Trying to find a VWC variable that correlates well with GW level so that I can remove GW level vars (no GW data in 2017)
- gw_4d_allWells and VWC_pre_wet_30cm most highly correlated (84%), but they were better correlated at Hungerford (94.5%). VWC_pre_wet_15cm and gw_4d_allWells slightly less correlated (80.2%), but were much better correlated at Hungerford (94.2%). At Hungerford I kept only VWC_pre_wet_15cm to serve as the GW proxy even though the 30cm equivalent was slightly better correlated to keep all soil vars (e.g., temp, DO, redox); I’m going to take the same approach at Wade (keep only VWC_pre_wet_15cm for now)


Decisions for eliminating variables w/ correlations >70% (contiuned)
- Tough decisions that need review (continued)
- SoilTemp_pre_wet_15cm and VWC_pre_wet_15cm are correlated (73.7%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters
- q_event_max and q_mm are highly correlated (74.3%), but less than at Hungerford (83.8%); let’s drop q_event_max for now, because?
- DOY and NO3_1d (mean stream NO3 conc 1 day prior to event start) are negatively correlated (-73.7%), keeping both for now, but should I?
- Tough decisions that need review (continued)
- rain_event_total_mm and API_4d are correlated (71.9%), leaving both for now, should I?
- Tough decisions that need review (continued)
- q_4d and DO_pre_wet_15cm are negatively correlated (-71.5%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters
- rain_event_total_mm and turb_event_max are correlated (70.4%); but the relationship isn’t great and it’s right at my arbitrary 70% correlation cutoff so leaving both for now
- I feel OK about these decisions, but they should be reviewed as well
- If the 1-d and 4-d values for a variable are highly correlated, use the 4-d value
- gw_1d_allWells and gw_4d_allWells are highly correlated (90.9%); remove gw_1d_allWells
- MET variables
- airT_1d and airT_4d are highly correlated (92.7%); removing airT_1d
- airT_4d and dewPoint_4d are highly correlated (96.3%), as is dewPoint_1d (92.4%); removing both dewPoints
- airT_4d and SoilTemp_pre_wet_15cm are highly correlated (92.7%); drop airT_4d b/c we still have diff_airT_soilT & soilT follows similar annual arc
- solarRad_1d and solarRad_4d are highly correlated (72.7%); drop solarRad_1d per rule above
- Q
- q_1d and q_4d are highly correlated (91.1%), so sticking with rule above will keep the q_4d
- q_event_delta & q_event_max are highly correlated (87.3%); I kept q_event_max for Hungerford, so let’s keep it vs delta
- Drop q_event_dQRate_cmsPerHr b/c it’s confusing and hopefully q_event_delta or rain intensity will capture this
- Rain
- Drop all the rain_Xd vars, b/c API_4d should cover this, though would be interesting to test how many days pre-event (e.g., 4 days for API) matters
- Unlike at Hungerford where they were correlated ((74.4%)), rain_int_mmPERmin_mean and rain_int_mmPERmin_max are less correlated at Wade (54.6%), so leaving both
- Redox
- If redox variables prove not to be important or they are highly correlated with another variable, we can remove them and increase n obs
- Stream
- turb_1d proved to be unuseful in driving clusters in SOM, so I removed it
Look at correlations again after dropping variables

Self-organizing map (SOM)
Prepare data & set up grid/lattice dimensions
We’re only using complete observations/rows (no NAs in any columns)
According to the heuristic rule from Vesanto 2000, number of grid elements/grid size/nodes = 5 * sqrt(n)
To determine the the shape of the grid (ratio of columns to rows), we use the ratio of the first two eigen values of the input data set as recommended by Park et al. 2006
## [1] No. of complete observations: 97 out of 120 observations
## [1] No. of Vesanto nodes: 49
## [1] Ratio of columns to rows: 1.3
Run SOM for a suite of grid/lattice configurations, # of nodes, and # of clusters
Code courtesy of Kristen Underwood (hidden)
## [1] Topology: hexagonal
## [1] Data normalization method used: L2norm
## [1] Weighting method used: noPCA
## [1] No. of iterations: 1000
## [1] alphaCrs used: 0.05
## [1] alphaFin used: 1000
Choose the best SOM run based on non-parametric F-stat and quantization error
We want to maximize npF (ratio of b/w cluster variance) and minimize QE (mean distance b/w each data vector & best-matching unit)
Here are the top 33% of runs based on npF
Examine boxplots of independent variables by cluster

Examine how antecedent and event conditions converge to influence N & P flux regimes:
How do our results differ if we choose the 2nd best SOM run?
## [1] The 2nd best run was:
| Run |
rows |
cols |
Nodes |
Clusters |
npF |
QE |
| 52 |
7 |
8 |
56 |
5 |
54.43617 |
0.123941 |
To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’

